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Abstract. The multispecies coalescent model describes the generation 
of gene trees from a rooted metric species tree, and thus provides a 
framework for the inference of species trees from sampled gene trees. 
We prove that the STAR method of Liu et al., and generalizations of 
it, are statistically consistent methods of topological species tree infer- 
ence under this model. We discuss the impact of gene tree sampling 
schemes for species tree inference using generalized STAR methods, and 
reinterpret the original STAR as a consensus method based on clades. 



1. Introduction 

While not always emphasized in practical analyses of biological sequences, 
the gene trees which standard phylogenetic methods infer may differ from the 
species tree relating the taxa from which the samples are taken. A frequent 
reason for this incongruence between gene trees and species trees is the popu- 
lation genetic effect of incomplete lineage sorting. Although this source of in- 
congruence has been recognized for many years, recently methods have been 
proposed to bring a statistical treatment of it into data analysis, through the 

framework of the multispecies coalescent model dRannala and Yang . 20031; 

Liu and Pearl. l2007l:lKubatko et all 120091 : iLiu et al.l . l2009l : lHeled and Drummond 
20101 : lLarget et all ^Olof T However, the subject is still young, with both 
theoretical issues concerning some of the methods not fully established, and 
practical understanding of their appropriate application not yet developed. 

Here we focus on the STAR method, of ILiu et al.1 (j2009l h for inferring; 
a rooted topological species tree from rooted topological gene trees. It is 
a fast approach to species tree construction tha t shows promise of good 
performan c e bot h in simulations ( Liu et al. . 20091 ) and in empirical studies 



(ILee et all l2012l h 



Since STAR bases its inference on topological gene trees, it discards all 
metric information one might have on them. While at first this may seem 
wasteful, there are good reasons one might prefer such an approach. In 
practice, the gene trees are likely to be themselves inferred by standard 
phylogenetic methods, so their edge lengths are often considered to be less 
reliably known than their topological structure. Moreover, modeling the evo- 
lution of sequences on gene trees that were produced under the coalescent 
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model requires reconciling time units on gene trees (reflecting total amounts 
of substitution), with time units on the species tree (reflecting population 
size and true time). This is usually resolved in part by assuming a molecu- 
lar clock, thus forcing gene trees to be ultrametric. Since variations in base 
substitution rates along gene lineages can be large, and inferred metric gene 
trees are often far from ultrametric, it is unclear what impact this approach 
has on data analysis. The authors of the Bayesian inference software BEST 
have, in fact, war ned that violations of the mol ecular clock could have 'sig- 
nificant impacts' ^Castillo-Ramirez etal.l .B. By using only topological 



gene tree topologies, STAR circumvents such problems. 

STAR proceeds by first combinatorially encoding each topological gene 
tree by a distance matrix. Averaging these over all gene trees produces a 
distance matrix that is then used to build a species tree. Finally, the metric 
information on the i nferred sp e cies t ree is discarded, so only its topology 
is retained. Though iLiu et al. (120091 1 suggest UPGMA or Neighbor Join- 



ing (NJ) for this step, in fact any distance method, including ones based 
on optimality criteria such as Minimum Evolution, could be used. That 
an algorithmic tree construction process such as Neighbor Joining is fast 
accounts only in part for the speed of STAR. Its avoidance of calculating 
the theoretical gene tree probabilities that would be needed in either Maxi- 
mum Likelihood or Bayesian analysis is also a major factor in its speed, so 
use of other distance methods can still be expected to have computational 
advantages. 



Although not investigated by Liu et al. ( 20091 ). it is a rather surprising 



fact that the expected distance matrix obtained by STAR applied to the 
theoretical gene tree distribution under the multispecies coalescent model 
will exactly fit a rooted tree with the same topology as the species tree. 
That this is true is crucial to the claim that STAR is a statistically consistent 
method of inference. In this work we prove this claim, and as a consequence 
establish the statistical consistency of the STAR method when any of a large 
class of distance methods are used for the final selection of the species tree. 

In the course of doing this, we also show that the method can be gener- 
alized into a parameterized family of STAR methods that are all consistent. 
The parameters specify a node numbering scheme which controls the way in 
which gene trees are encoded in distance matrices. (Although the particular 
no de numbering sc heme of the original method is referred to as a 'ranking' 
by lLiu et al. (l2009h . we avoid that terminology as it conflicts with the com- 



mon use of the term 'ranked tree' to indicate that internal nodes are assigned 
distinct values, often to indic ate the temporal orderin g of branching events 
in different parts of the tree (ISemple and Steel liooah .) The original STAR 



method arises from one particular choice of these parameters. 

Our work also shows that under the coalescent the expected distance 
tables obtained in all generalized STAR methods are ultrametric. This is 
important for understanding the behavior of UPGMA, Neighbor Joining, or 
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any distance method when applied to the ex pected distance table, a point 
that was not addressed by Liu et al. ( 20091 ) . It also suggests that in the 



final selection of the species tree from the average distances, it might be 
preferable to use a method that enforces ultrametricity. 

In closing, we show that the standard STAR method can be viewed as 
using only the clade structure of the gene trees. This observation allows a 
reinterpretation of the STAR distance as one by which taxa are viewed as 
close if they appear together in many clades across the gene trees. The STAR 
method can thus be viewed as a type of greedy consensus method using the 
clade structure of gene trees to infer the species tree. That it is a consensus 
method attuned in particular to the species tree/gene tree problem is the 
import of the fact that STAR is consistent under the coalescent model. 

Although our emphasis in this paper is on theory, we also discuss some 
possible implications for data analysis, focusing on various sampling schemes 
and how generalizations of the STAR method might possibly be used to gain 
a measure of confidence in species tree estimates. Additional work is needed 
to further develop the suggestions here. 



2. The standard STAR Node Numbering Scheme and Variants 

The encoding of an n-taxon binary rooted gene tree topology by a distance 
matrix in STAR is accomplished through first associating numbers to each 
node of the tree. The standard node numbering scheme used by STAR is as 
follows: Assign n to the root of the tree. Then if an internal node has been 
assigned a number m, assign to each of its children that are internal nodes 
the number m — 1. These numbers are then interpreted as distances from 
the leaves in an ultrametric tree, so that the distance between two leaves is 
twice the number assigned to their most recent common ancestor. In Figure 
[TJ for example, the distance between genes A and C is 6 on the left tree, 
and 10 on the right tree. 

An alternative viewpoint is that the standard STAR metrization of a gene 
tree is obtained from the rooted gene tree topology by assigning all internal 
edges length 1, and then choosing lengths of pendant edges so that all leaves 
are distance n from the root. 

It is immediately clear that some trivial variations on this numbering 
scheme will have no effect on the output of STAR. For instance, suppose 
instead of assigning the root the number n, one assigned to it any number i > 
n— 2, but otherwise followed the standard scheme. Then all internal branches 
would still have length 1, and pendant ones would have non- negative length, 
regardless of the topology of the gene tree. Indeed pendant edge lengths 
would only change by the addition of £ — n. The net effect on the pairwise 
distance matrix for each gene tree is to add 1{l—n) to each off-diagonal entry. 
Thus the empirical average of these matrices over a sample of gene trees, 
and the theoretical expected distance matrix change in the same way. (Note 
that these distance matrices will still have non- negative off-diagonal entries.) 
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Figure 1. Two 6-taxon gene trees with a STAR node- 
numbering. Both trees are drawn ultrametricly according 
to the standard STAR numbering with (00,01,02,03,04) = 
(6,5,4,3,2). Note that only the caterpillar tree shape (left) 
has distinct numberings for each node. For a generalized 
node numbering, the a% can be any non-negative decreasing 
sequence with at least two distinct terms. Node numbers 
are interpreted as remetrizing the gene tree, by specifying 
distances of the nodes from their descendant leaves. 



It is easy to see that this will have no effect on the topology of the species 
tree output by methods such as UPGMA and NJ. Similarly, assuming one 
chooses a sufficiently large number to assign to the root, one could number 
descendant nodes by any constant decrement from their parents, with no 
effect on the output of the topological species tree by UPGMA or NJ, since 
this would merely change the average distance matrix by a rescaling and 
addition of a constant. 

Thus assuming the output of standard STAR from the theoretical gene 
tree distribution does agree topologically with the species tree, there are 
certainly some variations on the numbering scheme that will have the same 
property. Less obvious variations of node numberings that can also be used 
for species tree inference in a generalized STAR method are the focus of the 
next section. 

By way of contrast, one might consider the following seemingly natural 
node numbering: Assign to each node the count of its leaf descendants. 
That this 'descendant-count' numbering scheme behaves differently from 
the STAR scheme is seen by considering the 5-taxon caterpillar species tree 



°" = (i((a,b):x,c):y,d):z,e), 
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where x,y,z are in coalescent units. Taking x = z = oo and y = 0, the only 
gene trees with non-zero probability under the coalescent model are 

T 1 = ((((a,b),c),d),e), P CT (Ti) = l/3, 

T 2 = ((((a,b),d),c),e), P CT (T 2 ) = l/3, 

T 3 = (((a,b),(c,d)),e), P CT (T 3 ) = l/3. 

Using these probabilities as weights in averaging the distance matrices 
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This expected distance matrix exactly fits the rooted tree 

(((a:2, 6:2):5/3, (c:10/3, d:10/3):l/3):4/3, e:5) 

which is topologically different from the species tree, even if both are treated 
as unrooted. Similar examples with positive and finite edge lengths can 
easily be constructed, by taking x, z to be large while y is small, since the 
expected distance matrix depends continuously on the edge lengths. 

If, instead, the standard STAR numbering were used in the above exam- 
ple, .Dti and Dt 2 would be unchanged, but in Dt 3 all entries of 4 would be 
replaced by 6. How this effects E CT (i>r), and thus changes the tree it fits, 
we leave as an easy exercise for the reader. 



3. Consistent node numbering schemes 

We next show that any node numbering scheme with several simple prop- 
erties, when used in STAR with UPGMA, NJ, or any well-behaved distance 
method for the selection of the species tree, leads to consistent inference. 
By 'well-behaved' we simply mean one that when applied to any distance 
table in some neighborhood of one exactly fitting T will return the topology 
of T . 

The properties we require for a node numbering scheme are: 
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(1) The number assigned to any internal node on a gene tree depends 
only on the count of nodes between it and the root in the topological 
gene tree. Thus the node numbers on all gene trees come from a 
common node numbering sequence ao,a±,..., a n _2, with ao assigned 
to all roots, a\ to their children, and more generally Oj to all internal 
nodes at depth i from the root. 

(2) A non-negative number is assigned to every node, with the number 
assigned to a child always less than or equal to that assigned to 
its parent, with at least one instance of strict inequality. That is, 
a-0 > ai > • • • > a n „2 > ctn-l = with some ai > a^ + i for some I. 

Equivalently, one could define an edge length sequence bx, 62, . . . , &n-i> 
where > 0, and at least one 6j > 0, and then assign length b{ to all 
internal edges at depth i from the root. If a pendant edge is at depth i, it is 
assigned the length X^/>j &r This is related to a node numbering sequence 
by defining bi = aj_i — a\. 

Both of these properties are natural. Property ([2]) ensures all internal 
branch lengths are non-negative, with at least one positive, while property 
([I]) ensures a connection to the number of coalescent events in a gene's 
lineage that occur above (i.e., temporally before) the specified node. 

Note that the standard STAR scheme is simply the special choice of the 
node numbering sequence n, n — 1, n — 2, . . . , 2, or, equivalently, of the 
edge length sequence 1,1,1,..., 2. On the other hand, the descendant- 
count scheme mentioned in the previous section fails to satisfy property 
([1]). Though it is tied to counting coalescent events, that scheme counts 
events below the node. 

We consider first a generalized STAR method where for each gene ex- 
actly one individual is sampled per taxon. Extensions to sampling multiple 
individuals will be discussed later. 



Definition 3.1. Anxn numerical matrix, with rows and columns indexed 
by an n-element set of taxa X, weakly fits a topological tree tp on X if 
there is some assignment of non- negative lengths to the edges of ip such that 
the resulting pairwise tree distances between leaves give the entries of the 
matrix. If the edge lengths are positive, then we say the matrix strongly fits 
ip. If the assignment of edge lengths yields an ultrametric tree, we say the 
matrix fits tp ultrametricly. 

For a complete presentation of the multispecies coalescent model, which 
describes the generation of ge ne trees from a metri c species tree, we refer 
to earlier works, for instance ( Allman et al. . 201 lbl ). The primary feature 



of the coalescent model that we use is the exchangeability of lineages: for 
a collection of lineages present in a population at a particular time, the 
probability of any pattern of coalescence (moving backwards in time) of 
these lineages is the same for all permutations of the lineages. 
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We denote a species tree on a set of taxa X by a = (ip,A), where tp 
is a rooted topological tree and A is a vector of edge lengths measured 
in coalescent units. We use lower case letters k to denote taxa, and 
the corrresponding upper case letters /, J, K to denote gene samples from 
those taxa. For any y C X, by the most recent common ancestor of y, 
MRCA(3^), we mean the vertex on ip that is ancestral to all elements of y, 
and a descendant of any other vertex ancestral to all of y. Thus y is always 
a subset of the set of descendants of MRCA(3'). An MRCA on a gene tree 
is defined similarly. 

Our main result is the following. 

Theorem 3.2. Under the multispecies coalescent model on a metric species 
tree a = (tp, A), let P a (T) denote the probability of a rooted topological gene 
tree T. For a node numbering sequence satisfying properties (pQ) and ([2]), let 
Dt denote the pairwise distance matrix obtained from applying the sequence 
to T. Then for all Aj G (0, oo) the expected value of this matrix, 

T 

is a pairwise distance matrix that strongly fits ip ultrametricly. 



Of course the pairwise distance matrix E CT (i>r) of this theorem does not 
generally match the pairwise distances on the species tree a. 

Proof of Theorem \3.2[ For a binary topological species tree ip with positive 
edge lengths Aj, it is enough to show that, for any three taxa i,j,k, if ip 
displays the rooted triple ((i,j), k), then with D = E cr (L'r), 

D(i,j)<D(i,k) = D(j,k). 

This is sim ply the 3-point condition for a distance to determine an ultra- 
metric tree ( Semple and Steel . 120031 ) . 



So suppose tp displays ((i,j), k), and let w = MRCA({i, j, k}) on a. Using 
shorthand such as C(I, J) < w to mean the coalescence of the (gene) lineages 
of / and J occurs below node w in the species tree, and F a (T, C(I, J) < w) 
to mean the joint probability of the gene tree T and the event C(I, J) < w, 
we see 

D(i,j)=J2^(T)D T (i,j) 

T 

(1) =Y,^(T,C(I,J)<w)D T (i,j) + Y J ^(T,C(I,J)>w)D T (i,j). 

T T 

But we claim 

(2) J> CT (T,C(/,J) <w)D T (i,j) <^P CT (T,C(J,J) <w)D T (i,k). 

T T 

To see this, first note that if the lineages of /, J coalesce below w and a gene 
tree T is observed, then T must display the rooted triple ((I,J),K). But 
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for such gene trees, Dx(i,j) < Dx{i,k). Thus a weak version of inequality 
(|2|) holds termwise. 

Note now that under the coalescent model on a species tree with finite 
edge lengths, if C(I, J) < w, then every labeled gene tree T which displays 
((I,J),K) is realizable. In particular, since ag > ai + i, and there is a gene 
tree displaying ((I, J), K) for which MRCA(I, J) is at depth l + l or greater, 
and the MRCA(I, K) is at depth I or less, one has DT(i,j) < Dx(i,k) for 
some gene tree T with P cr (T, C(I, J) < w) > 0. Thus at least one pair of 
corresponding terms in inequality ([2]) satisfies a strict inequality, and ([2]) is 
established. 

We claim also that 

(3) £> CT (T,C(/,J) >w)D T (i,j) = ^F a (T,C(I, J) >w)D T (i,k). 

T T 

Indeed, up to multiplication by a positive constant, these sums respectively 
give the conditional expectation of D(i,j) and D(i,k), given that the lin- 
eages of /, J, K are distinct at w. But by property ([1]) of the numbering 
scheme, and the exchangeability of lineages under the coalescent, these con- 
ditional expectations are the same. 

From equations (P), ([2]), and Q the inequality D(i,j) < D(i,k) now 
follows. 

To see that D(i, k) = D(j, k), observe 
D(i,k) = Y,F tT (T)D T (i,h) 

T 

= J2^a(T,C(I, J) < w)D T (i,k) +^P CT (r,C(/,J) > w)D T (i,k). 

T T 

However 

F a (T, C(I, J) < w)D T (i, k) = P a (T, C(I, J) < w)D T (j, k), 

since if the probability appearing in this equation is non-zero then the MRCA 
of /, K on T is the same as the MRCA of J, K. Moreover, 

C(I, J) > w)D T (i, k)=J2 V*(T, C(I, J) > w)D T {j, k) 

T T 
by an exchangability argument similar to that used in establishing inequality 
([3]). Thus D(i,k) = D(j,k), and the 3-point condition is established. 

For non-binary ip, one must also check that if tp displays the unresolved 
3-taxon subtree (i,j,k), then 

D(i,j)=D(i,k)=D(j,k). 

This is done using exchangability of lineages, as in the justification for equa- 
tion ©. □ 

Corollary 3.3. Consider the generalized STAR method using any node 
numbering sequence satisfying properties ([T]) and ([2]), and any method of 
tree selection that from a distance table strongly fitting a binary tree returns 
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that tree, and whose output is continuous at such tables. Then under the 
multispecies coalescent model on a = (ip,A), with ip binary and \i > 0, the 
method is statistically consistent for inferring the species tree topology ip. 

Proof. Let N be the number of sampled gene trees, and e > 0. By the 
law of large numbers, as N — > oo, the probability that the average distance 
matrix differs from E a (Dx) by more than e in any entry approaches 0. 
Thus by continuity of the distance method of tree selection, as N — > oo the 
probability that the inferred tree will have the same topology as that from 
E CT (i>r), ip, approaches 1. □ 



lAttesonl (119971 ) showed that the Neighbor Joining tree construction al- 



gorithm satisfies the continuity hypothesis of this corollary. A proof that 
UPGMA is continuous is straightforward. 

Example 3.4. As an extreme example of a node numbering scheme leading 
to consistent inference, for n species let cti = 2 for < i < n—2 and a n _2 = 1. 
Interpreting the values as distances from the leaves, this scheme converts 
every non-caterpillar gene tree into a completely unresolved tree, and every 
caterpillar gene tree into a gene tree with one non-trivial clade of two leaves. 
In effect, the resulting generalized STAR method discards all information 
gathered from non-caterpillar gene trees yet, by Theorem I3.2( the species 
tree topology can still be recovered. Thus it is possible, in principle, to 
reconstruct the species tree topology using only caterpillar gene trees as 
input, even if the species tree is not a caterpillar. Of course, we do not 
recommend this for practical inference. 

It is natural to focus especially on node numbering sequences that are 
strictly decreasing. Indeed, if all branch lengths on a = {ip, A) are very 
long, then with high probability a finite sample of topological gene trees 
will include only those matching ip. If a node numbering is not strictly 
decreasing, however, then an estimate of E CT (Z?r) from the sample would 
give D^, which may only weakly fit ip, leading to a poorly-resolved inferred 
species tree under STAR. 

The next theorem shows that strictly decreasing node numbering se- 
quences lead to better behavior of generalized STAR methods, in the sense 
that longer internal branches on the species tree do not make it more difficult 
to infer a fully-resolved species tree from a fixed sample size of gene trees. 
Since longer edges in a species tree increase the probability of the gene trees 
showing the associated split, this is intuitively desirable, and would certainly 
be a useful characteristic for data analysis. 

Recall now that the probability of any gene tree under the coalescent is 
a polynomial in the transfor med species tree bran ch lengths, exp(— Aj), as 
is explained, for instance, by Allman et al.l (j2011bl ). Thus for any collection 



{Dt} of matrices associated to gene trees (whether or not it arises from a 
node numbering sequence), the expected value E cr (Dr) also has entries that 
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are polynomial in the exp(— Aj). As a result, the expected value also makes 
sense for a branch length of oo, by setting the transformed branch length 
to 0. (Treating a branch length of oo in this way is equivalent to taking 
a limit as the branch length goes to oo.) While an infinite branch length 
of course has no real biological meaning, we will allow branch lengths in 
(0, oo] = (0, oo) U {oo}, both to simplify the presentation of arguments, and 
to easily describe behavior as branches grow long. 

Note also that the strong fitting of tp by K a (Dx) for finite A is expressible 
by equalities and strict inequalities in the matrix entries, arising from the 
3-point conditions. By continuity then, if some Aj — > oo the same equali- 
ties and non-strict versions of the inequalities must hold. That is, if branch 
lengths are in (0, oo], then E CT (Z>r) will certainly weakly fit tp. By requir- 
ing that the node numbering is strictly decreasing, however, this can be 
improved. 

Theorem 3.5. Suppose a node numbering sequence is strictly decreasing. 
Then for any ip and all Aj £ (0,oo], the expected distance matrix ¥ 1(T (Dt) 
strongly fits tp, ultrametricly. 

Moreover, if the node numbering sequence is not strictly decreasing, then 
there is a binary ip and choices of Aj £ (0, oo] for which E (J (Dt) only weakly 
fits ip. 

Proof. The proof of the first claim follows the argument of Theorem 13.21 
The primary modification is in the justification of inequality ([2]), when some 
of the internal edge lengths are Aj = oo. But since a strictly decreasing 
node numbering implies DT(i,j) < DT(i,k) for all gene trees T displaying 
((/, J),K), the inequality is immediately clear. 

For the second claim, suppose ap = ap + i and pick any binary species tree 
V> which has an internal edge between nodes u of depth I and v of depth 
i + 1. Assign edge lengths of oo to all edges incident to the ancestors of 
u, to the edge descending from u that is not incident to v, and to the two 
edges descending from v. Pick taxa i, j that are descendants of v through 
the two different edges, and taxon k that is a descendant of u but not of 
v. Then regardless of the other edge lengths, the only gene trees realizable 
under the coalescent will be those with one of MRCA(J, J), MRCA(J, K), 
and MRCA( J, K) at depth £+ 1, say MRCA(I, J), and thus MRCA(I, K) = 
MRCA(J, K) located at depth t as the parent of MRCA(I, J). Since a e = 
a i+1 , we have E a (D T )(i,j) = E a (D T )(i,k) = E a (D T )(j,k) = 2a £ are all 
equal, so E fT (I?r) cannot strongly fit tp. □ 

Remark. While we have established that properties (pQ) and ([2]) are sufficient 
for statistical consistency of STAR, we have not shown they are necessary. 
In fact, addressing this question for the full STAR process which includes the 
selection of the species tree by some algorithm or optimality criterion would 
be difficult, since even if an expected distance matrix does not exactly fit a 
tree, the fitting processes in the distance method may overcome the misfit. 
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Avoiding the impact of the final tree selection step, one can investigate 
the existence of alternative schemes in which distance matrices might be 
assigned to gene trees in such a way that for any species tree the expected 
distance matrix under the coalescent fits the topological species tree. Pre- 
liminary investigations indicate that at least for small trees such assignments 
exist that do not arise from generalized STAR numbering schemes. 



The theorems above were stated and proved under a gene tree sampling 
scheme in which one individual is sampled per taxon. We next consider 
extensions to multi-individual sampling. 

In the case of a sampling scheme in which ki individuals are sampled from 
taxon i for every gene, a simple device for extending STAR is the extended 
species trees: At leaf i on the original species tree attach k{ descendant edges 
in a multifurcation, leading to new taxa ij, 1 < j < fe;. These new pendant 
edges need not be assigned lengths, but all edges arising from the original 
species tree retain their lengths. 

The coalescent model on the original species tree with multiple individuals 
sampled per taxon then produces exactly the same distribution of gene trees 
as the coalescent on the extended species tree with one sample per taxon. 
Thus Theorem 13.21 applies. 

Since this extended species tree is not binary if any ki > 2, Corollary 



13.31 does not apply directly. However, the procedure proposed by iLiu et al 



( 20091 ) for multi- individual sampling is to first average over these individuals 
so that a distance matrix is obtained relating the original taxa. We sketch 
an argument that this leads to consistent inference. Since the average dis- 
tance table for the gene trees will approximate the expected one, which by 
Theorem 13.21 strongly fits the extended species tree ultrametrically, so will 
the result of further averaging this empirical distance table over its image 
under all permutations of individuals within each taxon. In this averaged 
table, distances between individuals in different taxa depend only on the 
two taxa, and not the individuals. Deleting all but one individual per taxon 
from this table will then yield a table that in expectation strongly fits the 
origina l species t r ee, an d whose entries are those calculated by the proce- 
dure of Liu et al. ( 20091 ). Thus as long as the original species tree is binary, 



one still has a consistent inference scheme. For an example of an empirical 
study tha t uses STAR with a variable number of individuals per species, see 
Lee et al.1 (|2012h . 



An even more general sampling scheme might vary the number of indi- 
viduals sampled by both taxon and gene. For instance if taxon i is sampled 
kij times for locus j, we could define a sampling scheme for locus j as 
kj = (ky, . . . , k n j), where n is the number of species. Letting /C = {k} be 
the finite set of all possible schemes used in a particular study design, if each 
locus has a sampling scheme independently chosen from some distribution 
on IC, then as the number of loci approaches infinity, the average empirical 
distance table for each sampling scheme approaches the expected distance 
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table for that sampling scheme by the Law of Large Numbers. Thus, the 
expected distance table over K, is a weighted average over the expected dis- 
tance tables for each k € /C, and hence also ultrametrically fits the species 
tree by the 3-point condition. Thus the multiple-allele version of STAR is 
consistent method of inference even with different numbers of alleles sampled 
at different loci. 

Finally, one could generalize further to allow different STAR node num- 
berings for different sampling schemes. More precisely, if k, £ K, specifies 
both a sampling scheme and numbering scheme for locus j, chosen according 
to some distribution on /C, then the STAR method will still be statistically 
consistent. Of course with such an approach an increased number of loci 
are likely to be needed for observed average distances to be close to their 
expected values. 

4. Finite samples of loci 

The statistical consistency of STAR methods, as established in the pre- 
ceding section, is a statement about asymptotic behavior as the number of 
sampled loci approaches infinity. In this section we collect some observations 
on possible pitfalls of naively applying STAR when the number of sampled 
loci is small. 

We begin with issues concerning multiple samples of individuals within 
taxa, where the number of individuals varies with the locus. 

First, imagine a rather extreme sampling scheme in which N loci are 
sampled, but that at locus j, taxon i is sampled from i + j individuals. 
Thus each sampling scheme is used for exactly one locus. The statistical 
consistency of STAR in this situation is not helpful to us, since without 
multiple uses of every sampling scheme we do not expect to see asymptotic 
behavior. The Law of Large Numbers does not apply, and the observed 
average distance table might be quite different from its expectation. 

If in a real study the number of sampling schemes used at different loci 
is very large in comparison to the number of loci, an appeal to the Law of 
Large Numbers may be similarly questionable. For example, if each of n 
species has between 1 and 3 individuals sampled at each locus, then there 
are potentially 3™ sampling schemes. We would need to sample enough 
loci to ensure that each of these exponentially many sampling schemes is 
used many times for the asymptotic claim of consistency to give us much 
confidence in an analysis. 

If, on the other hand, a study is designed for 3 individuals sampled per 
species and sequenced at all loci, a small amount of missing data might 
result in only a few of the 3 n sampling schemes being used, with each used 
repeatedly. Especially if the data were missing at random, it might be 
reasonable to use the multiple-allele versions of STAR. 

As another example of potentially problematic sampling, imagine sam- 
pling a large number of loci with only one individual per taxon, and a 
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single locus with one individual for most taxa, but many individuals for one 
taxon. (Such a data set is not farfetched, especially if it were built from 
data originally collected for other studies.) Then using the standard STAR 
numbering, the tree for this last locus will have a much larger number as- 
signed to its root. If there is substantial incomplete lineage sorting between 
the single-sample taxa and the multi-sample one, then on this tree internode 
distances between coalescent events involving only single-sample taxa will 
be larger than for the other trees. Thus it appears that it might be desirable 
to downweight this tree in the distance averaging, to avoid overemphasizing 
the evidence for relationships between the single-sample taxa that it offers. 
On the other hand, if there is little incomplete lineage sorting between the 
single-sample taxa and the multi-sample one, then downweighting will re- 
sult in an undesirable deemphasis of the evidence of relationships between 
the single-sampled taxa given by this locus. It is simply not clear how the 
anomalously-sampled locus could be used without potentially biasing the 
analysis. 

We note too that for finite sample sizes, the standard STAR numbering 
scheme and its variations may have biases for tree shape. To illustrate this, 
consider the case of sampling a single individual per taxon for four taxa. 

Now suppose UPGMA is applied to the average distance table using the 
standard STAR numbering with (ao, 01,02) = (4,3,2), with ties being re- 
solved randomly. If there are exactly two input gene trees, then STAR- 
UPGMA will return a symmetric tree in exactly the following cases: 

(i) both input trees are symmetric; 

(ii) one input tree is symmetric and one is asymmetric, and they share a 
cherry (a 2-taxon clade), in which case a symmetric tree is returned 
with probability 1/2; and 

(iii) both input trees are asymmetric, and their cherries do not have 
overlapping taxa, in which case a symmetric tree is returned with 
probability 2/3. 

In case (iii), for instance, if the input trees are (((AB)C)D) and (((CD)B)A), 
then the expected distance table averaged over both loci is 



E(D T ) = 



(0 


6 


7 




6 





6 


7 


7 


6 





6 




7 


6 


V 



which results in an asymmetric tree if B and C are clustered first (which 
occurs with probability 1/3), and otherwise results in a symmetric tree. 

Assuming that gene trees evolve on an unresolved species tree, unlabeled 
gene trees have a 1/3 probability of being symmetric and 2/3 probability 
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of being asymmetric under the coalescent. The probability that STAR- 
UP GMA returns a symmetric tree in this situation is therefore 
11 1211 2212_19 1 
3'3 + '3'3'3'2 + 3'3'6'3~81 < 3' 
indicating that there is some bias against symmetric trees in this case. (To 
understand the second summand in the expression on the left, for example, 
notice that the two trees are symmetric and asymmetric with probability 
2 • (|) (|), that these two trees share a cherry with probability |, and that 
UP GMA will return the symmetric tree with probability 5.) 

While these computations show bias is present, we note that this analysis 
has discarded the edge lengths on the UPGMA tree, even though they carry 
information about when a tree is close to a different topological structure. 

Now it is possible that numbering schemes can be chosen to reduce bias 
when only finitely many loci are sampled. For example, for the numbering 
scheme (00,01,02) = (4,3,1), case (hi) always results in STAR returning 
a balanced tree, and the total probability of returning a symmetric tree is 
21/81, suggesting less bias than the standard STAR numbering, at least for 
this example. 

Whether topological bias of STAR for small samples can be reduced for 
all species trees simultaneously, or at least on average under some model of 
species tree generation, is an interesting topic for future investigation. 

5. STAR AND CLADES 

The standard node numbering scheme, in which all internal edges on gene 
trees are given length 1, produces distances on remetrized gene trees that has 
an interpretation regarding clades on gene trees. In this section we develop 
this connection. 

As a consequence, we recover a previously known result that the topolog- 
ical rooted sp ecies tree can be ide ntified from the collection of gene clade 
probabilities ( Allman et al. . 2011al ). Additionally, with this viewpoint, we 



obtain an efficient algorithm for the identification. 

This connection between STAR and clades makes clear that whatever 
information STAR uses for inference is found in the clade probabilities, and 
not in the more detailed distribution of gene trees. 

Consider a gene tree T on a collection of taxa X. By a clade on T we mean 
a subset C C X for which the set of descendants of MRCA(C) is precisely 
C. There is thus a bijection between nodes v of T and clades C, by which 
v = MRCA(C). The trivial clades are X, with MRCA at the root of the 
tree, and singleton subsets of X, with MRCAs at the leaves. The trivial 
clades occur on all gene trees on X. 

Given any two leaves a, b of T, let H a ^ denote the set of non-trivial clades 
of T which contain both a and b, and let er(o, b) = \H a ^\. Elements of H a ,b 
correspond to the non-root vertices of T lying on the path from the root 
to the MRCA({a, 6}). Thus cr(a, b) is simply the distance from the root to 
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the MRCA({a, b}) in T remetrized by the standard STAR numbering. Since 
this tree is ultrametric, with all leaves n = \X\ from the root, one finds 

fA\ r n D T (a,b) 

(4) c T (a,b)=n ±±-L. 

This shows that, at least on individual gene trees, the distances used in 
STAR are essentially counts of clades, with gene samples being judged closer 
when they occur together in more clades. 

Now fixing a species tree a, the coalescent model determines gene tree 
probabilities P CT (T). If C is a non-trivial clade, and P CT (C) denotes the prob- 
ability of clade C, then 

p -( c ) = Yl P -( T )- 

T displaying C 

Proposition 5.1. Suppose a = (ip,\) is an n-taxon rooted binary metric 
species tree. Then the clade probabilities {P CT (C)} determine E CT (Z>r). 

Proof. Define two indicator functions: 

Ic(T) = l l ^ T displays C, ^ ^ fl S a,b e C, 

I otherwise, I otherwise. 

Then using equation @, 

E a (D T (a, b)) = Y Va(T) 2(n - c T (a, b)) 

T 

= 2n-2^P a (T)c T (a,6) 

T 

= 2n-2^P ( W ]T HT)Ia,b(C)) 

X ^ non-trivial ' 

clades C 

= 2n-2 Y (T.V'WbF)) 1 **® 

non-trivial T 
clades C 

(5) = 2n-2 ]T V*(C)Ia, b (C), 



non-trivial 
clades C 



and the expected STAR distance between a and b is computable from clade 
probabilities. □ 

Remark. Though this statement and proof involve theoretical distributions 
and expected STAR distances, a similar statement and argument also apply 
to empirical clade frequencies and empirical STAR distances computed from 
a gene tree sample. 

From Theorem 13.21 we immediately obtain the following: 
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Corollary 5.2. The rooted species tree topoogy ip is identifiable from clade 
probabilities under the multispecies coalescent. 

Remark. Proposition 15.11 also suggests an efficient algorithm for computing 
ip from {P(C)}, or an estimate of tj) from estimates of {P(C)}. 

We note that the program BUCKy (|Ane et all 120071 ) uses estimated clade 



probabilities to construct a concordance tree from the clades with highest 
estimated probabilities one clade at a time using a greedy consensus ap- 
proach, Motiyated b y the observation that this method can be misleading 
(Dcgnan et al.l . l2009h . a quartet version of BUCKy was developed tha t build s 



the species tree from the most supported quartets ( Larget et al. . 2010l ). 



However, the observation that STAR distances can be computed from clade 
probabilities suggests an alternative method for computing the concordance 
tree using Equation ([5]) that would be consistent under the multispecies 
coalescent. 

6. Discussion 

The main results of this paper are that the STAR algorithm and its gen- 
eralizations to other node numbering schemes, using any well-behaved dis- 
tance method for the selection of the species tree, do provide st atistically 



consis tent methods of species tree inference. As demonstrated bv lLiu et al 



(120091 ). me method is also fast and, at least in simulations, exhibited good 



performance in comparison to some other methods of species tree inference. 

While our focus here has been theoretical, our work indicates several 
insights that may be helpful in improving practical data analysis. First, be- 
cause the expected distance tables under STAR are ultrametric, it may be 
preferable to use UPGMA rather than Neighbor Joining as the tree building 
method in STAR, as UPGMA enforces an ultrametric assumption while NJ 
does not. Alternatively, if NJ is used, then even though the reconstructed 
distances on the inferred species tree are ultimately discarded, it might be 
worthwhile to first examine them to see if they roughly give ultrametricity. 
If not, then one might doubt either the fit of the multispecies coalescent 
model to the gene tree data, or that one had a sufficiently large sample of 
gene trees for adequate inference. One should also consider using a more 
elaborate distance method based on an optimality criterion, such as Min- 
imum Evolution, although this would require a tree search and therefore 
should slow runtimes. Fast heuristic searches under such a criterion, possi- 
bly adapted to enforce ultrametricity, might improve STAR's performance. 
Simulation studies are needed to further investigate these issues. 

Second, the generalizations of the STAR method using alternate number- 
ing schemes also need further investigations. Although all such schemes lead 
to consistent inference, this is an asymptotic statement that concerns be- 
havior with large samples. It is still possible that some numbering schemes 
lead to more efficient inference, in that with a small sample of gene trees 
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they are more likely to return the correct species tree, or one close to it. 
Understanding this issue, perhaps for specific species trees, or better still 
for specific models of species tree generation, would be desirable. 

Even without more detailed theoretical results on the behavior of the 
generalized STAR methods, they might still be useful in data analysis. For 
instance, for a fixed sample of gene trees, one could repeatedly infer a species 
tree using randomly chosen node numbering schemes satisfying the criteria 
in Section [3l If all such species trees have the same topology, one might be 
more confident in the result. Significant differences in the inferred topologies 
might lead one to doubt the validity of all the trees, again due to either poor 
model fit of the coalescent or inadequate sample size. More minor variations 
in the inferred trees might suggest which features of the species tree one 
might be confident of, based on their shared characteristics. Because of 
the computational speed of STAR, calculations with many random node 
numberings should be feasible. 
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